In [1]:
import warnings
warnings.filterwarnings("ignore")
import os
import time
import re
import pandas as pd 
import numpy as np
import yellowbrick
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.font_manager import FontProperties
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
from sklearn import metrics, preprocessing
from sklearn.svm import SVC
from sklearn.metrics import average_precision_score, precision_score, recall_score, f1_score, confusion_matrix, accuracy_score, classification_report, roc_curve, auc, roc_auc_score, silhouette_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from scipy.cluster.hierarchy import linkage, dendrogram, cut_tree
from sklearn import decomposition
import scipy.stats as stats
from scipy.linalg import eigh
from math import factorial as f
from pylab import rcParams
rcParams['figure.figsize'] = 10, 15
%matplotlib inline


ONLY HERE FOR SCENARIO WITH OUTLER DETECTION AND IMPUTATION SCENARIO.
THIS IS NOT THE ACTUAL PROJECT. PLEASE REFER OTHER FILE FOR "PART A" AND FULL PROJECT



PART B¶


QUESTION 1¶


*SOLUTION (1 A.)*¶

In [2]:
vehicle = pd.read_csv("C:/Users/pri96/OneDrive/Documents/AI and ML PGP/Module 5 - Unsupervised Learning (Week 17 to Week 19)/Project/vehicle.csv")
vehicle.head()
Out[2]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
In [3]:
print("There are", vehicle.shape[0], "rows and", vehicle.shape[1], "columns in the dataframe")
There are 846 rows and 19 columns in the dataframe

*SOLUTION (1 B.)*¶

In [4]:
vehicle.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   compactness                  846 non-null    int64  
 1   circularity                  841 non-null    float64
 2   distance_circularity         842 non-null    float64
 3   radius_ratio                 840 non-null    float64
 4   pr.axis_aspect_ratio         844 non-null    float64
 5   max.length_aspect_ratio      846 non-null    int64  
 6   scatter_ratio                845 non-null    float64
 7   elongatedness                845 non-null    float64
 8   pr.axis_rectangularity       843 non-null    float64
 9   max.length_rectangularity    846 non-null    int64  
 10  scaled_variance              843 non-null    float64
 11  scaled_variance.1            844 non-null    float64
 12  scaled_radius_of_gyration    844 non-null    float64
 13  scaled_radius_of_gyration.1  842 non-null    float64
 14  skewness_about               840 non-null    float64
 15  skewness_about.1             845 non-null    float64
 16  skewness_about.2             845 non-null    float64
 17  hollows_ratio                846 non-null    int64  
 18  class                        846 non-null    object 
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB

Based on the above information, we can infer below:

  • All values are of numerical type, except class. We can use label encoding for this to convert it to numerical type feature however we won't be doing that right now and would see if it's required in future
  • There are no null values for 5 features (class, hollow_ratio, max.length_rectangularity, max.length_aspect_ratio, compactness). rest all features have null values and require imputation. We'll impute them with their respective median values
In [5]:
# Check percentage of missing values in each column
missing_percentages = vehicle.isnull().mean() * 100

# Print missing percentages
print("Percentage of missing values in each column:")
print(missing_percentages)
vehicle.isnull().sum()
Percentage of missing values in each column:
compactness                    0.000000
circularity                    0.591017
distance_circularity           0.472813
radius_ratio                   0.709220
pr.axis_aspect_ratio           0.236407
max.length_aspect_ratio        0.000000
scatter_ratio                  0.118203
elongatedness                  0.118203
pr.axis_rectangularity         0.354610
max.length_rectangularity      0.000000
scaled_variance                0.354610
scaled_variance.1              0.236407
scaled_radius_of_gyration      0.236407
scaled_radius_of_gyration.1    0.472813
skewness_about                 0.709220
skewness_about.1               0.118203
skewness_about.2               0.118203
hollows_ratio                  0.000000
class                          0.000000
dtype: float64
Out[5]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64
In [6]:
columns_with_unexpected_values = []
for column in vehicle.columns:
    unique_values = vehicle[column].unique()
    unexpected_values = []
    
    for value in unique_values:
        if pd.isna(value):  # Checking for NaN values
            unexpected_values.append(value)
        elif not pd.api.types.is_numeric_dtype(vehicle[column]) and not isinstance(value, str):
            unexpected_values.append(value)  # Checking for non-string non-numeric values, which is highly unlikely            
    
    if unexpected_values:
        print(f"Column '{column}' has unexpected values: {unexpected_values}")
        columns_with_unexpected_values.append(column)

# Checking for unexpected values across all datapoints (rows)
unexpected_rows = pd.DataFrame(vehicle[vehicle.isnull().any(axis = 1)])
if not unexpected_rows.empty:
    print(f"\nAnd some of those unexpected values across below {len(unexpected_rows)} rows:\n\n")
else:
    print("No unexpected values found across datapoints.")
unexpected_rows.head()
Column 'circularity' has unexpected values: [nan]
Column 'distance_circularity' has unexpected values: [nan]
Column 'radius_ratio' has unexpected values: [nan]
Column 'pr.axis_aspect_ratio' has unexpected values: [nan]
Column 'scatter_ratio' has unexpected values: [nan]
Column 'elongatedness' has unexpected values: [nan]
Column 'pr.axis_rectangularity' has unexpected values: [nan]
Column 'scaled_variance' has unexpected values: [nan]
Column 'scaled_variance.1' has unexpected values: [nan]
Column 'scaled_radius_of_gyration' has unexpected values: [nan]
Column 'scaled_radius_of_gyration.1' has unexpected values: [nan]
Column 'skewness_about' has unexpected values: [nan]
Column 'skewness_about.1' has unexpected values: [nan]
Column 'skewness_about.2' has unexpected values: [nan]

And some of those unexpected values across below 33 rows:


Out[6]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
5 107 NaN 106.0 172.0 50.0 6 255.0 26.0 28.0 169 280.0 957.0 264.0 85.0 5.0 9.0 181.0 183 bus
9 93 44.0 98.0 NaN 62.0 11 183.0 36.0 22.0 146 202.0 505.0 152.0 64.0 4.0 14.0 195.0 204 car
19 101 56.0 100.0 215.0 NaN 10 208.0 32.0 24.0 169 227.0 651.0 223.0 74.0 6.0 5.0 186.0 193 car
35 100 46.0 NaN 172.0 67.0 9 157.0 43.0 20.0 150 170.0 363.0 184.0 67.0 17.0 7.0 192.0 200 van
66 81 43.0 68.0 125.0 57.0 8 149.0 46.0 19.0 146 169.0 323.0 172.0 NaN NaN 18.0 179.0 184 bus
In [7]:
# Imputing above instances of missing values with median
for column in columns_with_unexpected_values:
    vehicle[column].fillna(vehicle.groupby('class')[column].transform('median'),inplace = True)

vehicle.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   compactness                  846 non-null    int64  
 1   circularity                  846 non-null    float64
 2   distance_circularity         846 non-null    float64
 3   radius_ratio                 846 non-null    float64
 4   pr.axis_aspect_ratio         846 non-null    float64
 5   max.length_aspect_ratio      846 non-null    int64  
 6   scatter_ratio                846 non-null    float64
 7   elongatedness                846 non-null    float64
 8   pr.axis_rectangularity       846 non-null    float64
 9   max.length_rectangularity    846 non-null    int64  
 10  scaled_variance              846 non-null    float64
 11  scaled_variance.1            846 non-null    float64
 12  scaled_radius_of_gyration    846 non-null    float64
 13  scaled_radius_of_gyration.1  846 non-null    float64
 14  skewness_about               846 non-null    float64
 15  skewness_about.1             846 non-null    float64
 16  skewness_about.2             846 non-null    float64
 17  hollows_ratio                846 non-null    int64  
 18  class                        846 non-null    object 
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB

Now all columns have non null values. Let's check the 5 Point summary for the dataframe

In [8]:
vehicle.describe(include = 'all').T
Out[8]:
count unique top freq mean std min 25% 50% 75% max
compactness 846.0 NaN NaN NaN 93.678487 8.234474 73.0 87.0 93.0 100.0 119.0
circularity 846.0 NaN NaN NaN 44.826241 6.13434 33.0 40.0 44.0 49.0 59.0
distance_circularity 846.0 NaN NaN NaN 82.066194 15.754263 40.0 70.0 80.0 98.0 112.0
radius_ratio 846.0 NaN NaN NaN 168.916076 33.427561 104.0 141.0 167.25 195.0 333.0
pr.axis_aspect_ratio 846.0 NaN NaN NaN 61.680851 7.882557 47.0 57.0 61.0 65.0 138.0
max.length_aspect_ratio 846.0 NaN NaN NaN 8.567376 4.601217 2.0 7.0 8.0 10.0 55.0
scatter_ratio 846.0 NaN NaN NaN 168.920804 33.199802 112.0 147.0 157.0 198.0 265.0
elongatedness 846.0 NaN NaN NaN 40.927896 7.813401 26.0 33.0 43.0 46.0 61.0
pr.axis_rectangularity 846.0 NaN NaN NaN 20.579196 2.590879 17.0 19.0 20.0 23.0 29.0
max.length_rectangularity 846.0 NaN NaN NaN 147.998818 14.515652 118.0 137.0 146.0 159.0 188.0
scaled_variance 846.0 NaN NaN NaN 188.643026 31.37802 130.0 167.0 179.0 217.0 320.0
scaled_variance.1 846.0 NaN NaN NaN 439.665485 176.492876 184.0 318.25 364.0 586.75 1018.0
scaled_radius_of_gyration 846.0 NaN NaN NaN 174.712766 32.546284 109.0 149.0 174.0 198.0 268.0
scaled_radius_of_gyration.1 846.0 NaN NaN NaN 72.443262 7.470873 59.0 67.0 71.0 75.0 135.0
skewness_about 846.0 NaN NaN NaN 6.356974 4.904073 0.0 2.0 6.0 9.0 22.0
skewness_about.1 846.0 NaN NaN NaN 12.604019 8.930921 0.0 5.0 11.0 19.0 41.0
skewness_about.2 846.0 NaN NaN NaN 188.919622 6.152167 176.0 184.0 188.0 193.0 206.0
hollows_ratio 846.0 NaN NaN NaN 195.632388 7.438797 181.0 190.25 197.0 201.0 211.0
class 846 3 car 429 NaN NaN NaN NaN NaN NaN NaN
  • There are 846 instances with 19 attributes (columns) including both numerical (18) and categorical (1) features
  • SOme attributes like radius_ratio, pr.axis_aspect_ratio, max.length_aspect_ratio, scaled_variance.1, scaled_radius_of_gyration, and skewness_about.1 have relatively high standard deviations compared to their means, suggesting potential outliers or significant variability in the data
  • The class attribute has 3 unique classes (car, van, bus) with car being the most frequent (429 instances). This suggests an imbalance where one class (car) dominates
  • Compactness and circularity have mean and median values almost similar, which signifies that they both are normally distributed and have no skewness/outlier
  • We can have further insights with various EDA

*SOLUTION (1 C.)*¶

In [9]:
# Count the occurrences of each class
class_counts = vehicle['class'].value_counts()

# Plotting a pie chart
plt.figure(figsize = (8, 6))
plt.pie(class_counts, labels = class_counts.index, autopct = '%1.1f%%', startangle = 140)
plt.title('Distribution of Classes')
plt.show()

# Print percentage of values for each class
print("Percentage of values for variable 'class':")
print(class_counts / len(vehicle) * 100)
No description has been provided for this image
Percentage of values for variable 'class':
class
car    50.709220
bus    25.768322
van    23.522459
Name: count, dtype: float64

Based on above pie-chart, we see that:

  • Appproximately 50.7% of the vehicles in the dataset are classified as cars. The percetage division of buses and vans are ~25.8% and ~23.5% respectively
  • The dataset is slightly imbalanced towards cars, which constitute more than half of the vehicles. Buses and vans make up the remaining portion, with buses being slightly more frequent than vans

We can also say that the models trained on this dataframe may be biased towards predicting 'car' instances more accurately due to their higher representation in the dataset

*SOLUTION (1 D.)*¶

In [10]:
duplicate_rows = vehicle[vehicle.duplicated()]

if not duplicate_rows.empty:
    print(f"Number of duplicate rows: {len(duplicate_rows)}")
    print("Duplicate rows:")
    print(duplicate_rows)
else:
    print("No duplicate rows found.")
No duplicate rows found.

There are no duplictae rows so no further steps required for impution/correctness

Before proceeding to next parts, let's have some analysis on the given dataset

PAIR PLOT¶

In [11]:
sns.pairplot(vehicle, diag_kind = 'kde', hue = 'class')
Out[11]:
<seaborn.axisgrid.PairGrid at 0x2e37a01a1d0>
No description has been provided for this image

The pair plot shows that:

  • 'Compactness' is least spread for vans and maximum spread for cars whereas, it is right-skewed for buses indicating that less number of buses have high compactness
  • 'scaled_radius_of_gyration', 'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1', 'skewness_about.2', 'pr.axis_aspect_ratio', have almost similar distribution for cars, buses and vans
  • max.length_aspect_ratio is almost same for cars and vans however, it is lower for buses. On the other hand, is has almost same distribution for car, van and buses
  • While 'hollows_ratio' is lower for buses as compared to cars and vans, mean 'elomngatedness' is highest for vans followed by bus and car
  • 'pr.axis_rectangularity' and mean 'scales_variance' both are highest for cars, followed by bus and then vans
  • Many columns have long tails indicating outliers

CORRELATION & HEAT MAP¶

In [12]:
plt.figure(figsize = (15, 8))
sns.heatmap(vehicle.select_dtypes(['float64', 'int64']).corr(), cmap = 'coolwarm', annot = True, fmt = ".2f")
Out[12]:
<Axes: >
No description has been provided for this image
  • There are few variables, such as, 'skewness_about', 'skewness_about.1', 'skewness_about.2'. 'hollows.ratio' which exibhit weak realtionship though almost all attributes
  • Many columns( example - 'circularity' and 'max.length_rectangularity') are highly correlated. Since multiple features are highly correlated with one another we face the risk of multicollinearity. We can use PCA for such features
  • Strongest correlation of 0.99 is found between some features such as 'scatter_ratio' & 'scaled_variance.1'

BOX PLOT¶

In [13]:
plt.figure(figsize = (20,15))
sns.boxplot(vehicle, orient = 'h')
plt.title(f"Box Plot for various features combined")
plt.xticks(rotation = 90)
plt.show()
No description has been provided for this image

We see there are many features where we have outliers such as such as radius_ratio, pr.axis_aspect_ratio, max.length_aspect_ratio, scaled_variance, scaled_variance.1, skewness_about, skewness_about.1.

Let's see if we can treat those outliers so they do not affect our final predictions. We can impute the outliers with the respective median of the specific column

In [14]:
# finding the outliers and replace them by median
outliers = pd.DataFrame()
for col_name in vehicle.columns[:-1]:
    q1 = vehicle[col_name].quantile(0.25)
    q3 = vehicle[col_name].quantile(0.75)
    iqr = q3 - q1
    
    # Defining outlier boundaries
    lower_bound = q1 - 1.5 * iqr
    higher_bound = q3 + 1.5 * iqr
    
    #  Finding rows with outliers
    outlier_rows = vehicle[((vehicle[col_name] < lower_bound) | (vehicle[col_name] > higher_bound))]
    
    # Append to outliers DataFrame
    outliers = pd.concat([outliers, outlier_rows])

print("There are", outliers.shape[0], "rows with outliers which contribute to", 
      format(outliers.shape[0]*100/vehicle.shape[0], '.2f'), "% of overall data. Few  records are - ")
outliers.head()
There are 55 rows with outliers which contribute to 6.50 % of overall data. Few  records are - 
Out[14]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
37 90 48.0 86.0 306.0 126.0 49 153.0 44.0 19.0 156 272.0 346.0 200.0 118.0 0.0 15.0 185.0 194 van
135 89 47.0 83.0 322.0 133.0 48 158.0 43.0 20.0 163 229.0 364.0 176.0 97.0 0.0 14.0 184.0 194 van
388 94 47.0 85.0 333.0 138.0 49 155.0 43.0 19.0 155 320.0 354.0 187.0 135.0 12.0 9.0 188.0 196 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
37 90 48.0 86.0 306.0 126.0 49 153.0 44.0 19.0 156 272.0 346.0 200.0 118.0 0.0 15.0 185.0 194 van

Let's impute the values with median grouped by their class values now

In [15]:
# Function to find outliers and replace them with median values grouped by 'class'
def replace_outliers_with_median(vehicle):
    outliers = pd.DataFrame()
    
    # Finding outliers
    for col_name in vehicle.columns[:-1]:  # Excluding the 'class' column
        q1 = vehicle[col_name].quantile(0.25)
        q3 = vehicle[col_name].quantile(0.75)
        iqr = q3 - q1

        # Defining outlier boundaries
        lower_bound = q1 - 1.5 * iqr
        higher_bound = q3 + 1.5 * iqr

        # Finding rows with outliers
        outlier_rows = vehicle[((vehicle[col_name] < lower_bound) | (vehicle[col_name] > higher_bound))]
        
        # Append to outliers DataFrame
        outliers = pd.concat([outliers, outlier_rows])
    print("There are", outliers.shape[0], "rows with outliers which contribute to", 
    format(outliers.shape[0]*100/vehicle.shape[0], '.2f'), "% of overall data")

    # Replace outliers with median values grouped by 'class'
    for col_name in vehicle.columns[:-1]:  # Excluding the 'class' column
        for cls in vehicle['class'].unique():
            # Get the class-specific data
            grouped_by_class = vehicle[vehicle['class'] == cls]
            median_value = grouped_by_class[col_name].median()

            # Replace outliers with the median
            vehicle.loc[outliers[(outliers['class'] == cls) & (outliers[col_name].isin(grouped_by_class[col_name]))].index, col_name] = median_value

    return vehicle

# Example usage:
# Replace outliers in the 'vehicle' DataFrame
vehicle = replace_outliers_with_median(vehicle)

print("Outliers have been replaced with median values grouped by class.")
vehicle.head()
There are 55 rows with outliers which contribute to 6.50 % of overall data
Outliers have been replaced with median values grouped by class.
Out[15]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 89 44.0 72.0 167.5 64.0 6 152.0 44.0 19.0 145 177.0 344.0 176.0 76.0 5.0 10.0 186.0 189 bus

Let's plot the box plot again to see if outliers are improved now?

In [16]:
plt.figure(figsize = (20,15))
sns.boxplot(vehicle, orient = 'h')
plt.title(f"Box Plot for various features combined - After imputing Outliers with medians")
plt.xticks(rotation = 90)
plt.show()
No description has been provided for this image

We now see a much improved version of box plot where, though we still see some outliers, they are not on their extremes and are very near to the actual data.
Let's check the 5 point summary again

In [17]:
vehicle.describe(include = 'all').T
Out[17]:
count unique top freq mean std min 25% 50% 75% max
compactness 846.0 NaN NaN NaN 93.567376 7.937452 73.0 88.0 93.0 99.0 116.0
circularity 846.0 NaN NaN NaN 44.776596 6.000471 33.0 40.0 44.0 49.0 59.0
distance_circularity 846.0 NaN NaN NaN 82.098109 15.376184 40.0 70.0 79.0 96.0 112.0
radius_ratio 846.0 NaN NaN NaN 167.925532 31.150509 104.0 142.0 167.5 193.0 246.0
pr.axis_aspect_ratio 846.0 NaN NaN NaN 61.190307 5.530374 47.0 57.0 61.0 65.0 76.0
max.length_aspect_ratio 846.0 NaN NaN NaN 8.120567 2.026365 3.0 7.0 8.0 10.0 13.0
scatter_ratio 846.0 NaN NaN NaN 168.6513 32.383284 112.0 146.0 157.0 196.75 262.0
elongatedness 846.0 NaN NaN NaN 40.917258 7.635344 26.0 33.25 43.0 46.0 61.0
pr.axis_rectangularity 846.0 NaN NaN NaN 20.549645 2.532958 17.0 19.0 20.0 23.0 28.0
max.length_rectangularity 846.0 NaN NaN NaN 147.916076 14.2083 118.0 137.0 146.0 158.75 188.0
scaled_variance 846.0 NaN NaN NaN 187.841608 29.941339 130.0 167.0 178.0 214.0 288.0
scaled_variance.1 846.0 NaN NaN NaN 437.725768 171.764061 184.0 318.0 364.5 576.0 987.0
scaled_radius_of_gyration 846.0 NaN NaN NaN 174.334515 31.80001 109.0 150.0 174.0 196.75 268.0
scaled_radius_of_gyration.1 846.0 NaN NaN NaN 72.021277 6.094873 59.0 68.0 71.0 75.0 87.0
skewness_about 846.0 NaN NaN NaN 6.153664 4.545109 0.0 2.0 6.0 9.0 19.0
skewness_about.1 846.0 NaN NaN NaN 12.567376 8.761107 0.0 6.0 11.0 18.0 40.0
skewness_about.2 846.0 NaN NaN NaN 188.938534 6.001657 176.0 185.0 188.0 193.0 204.0
hollows_ratio 846.0 NaN NaN NaN 195.661939 7.243468 181.0 191.0 197.0 201.0 211.0
class 846 3 car 429 NaN NaN NaN NaN NaN NaN NaN

We see an overall improvement in the outliers and we can see it clearly from the box plot and the 5 point summary as well. Let's proceed with further steps now.

QUESTION 2¶


*SOLUTION (2 A.)*¶

Before proceeding, let's do encoding for 'class' feature all other are numerical but this. It will be easier to have this as target variable while splitting

In [18]:
vehicle['class'].replace(['bus','car','van'],[0,1,2],inplace = True)
vehicle.sample(5, random_state = 42)
Out[18]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
39 81 45.0 68.0 169.0 73.0 6 151.0 44.0 19.0 146 173.0 336.0 186.0 75.0 7.0 0.0 183.0 189 0
250 95 38.0 66.0 126.0 52.0 8 133.0 52.0 18.0 140 158.0 253.0 140.0 78.0 11.0 8.0 184.0 183 2
314 90 42.0 63.0 126.0 55.0 7 152.0 45.0 19.0 142 173.0 336.0 173.0 81.0 0.0 15.0 180.0 184 0
96 89 42.0 80.0 151.0 62.0 6 144.0 46.0 19.0 139 166.0 308.0 170.0 74.0 17.0 13.0 185.0 189 1
198 81 46.0 71.0 130.0 56.0 7 153.0 44.0 19.0 149 172.0 342.0 191.0 81.0 3.0 14.0 180.0 186 0
In [19]:
# Splitting data into X and Y

X = vehicle.drop('class',axis = 1)  # All independent variables (i.e., excluding 'class' as that is target variable)
y = vehicle['class'] # Target variable
In [20]:
print("Shape of X -->", X.shape)
X.head()
Shape of X --> (846, 18)
Out[20]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207
4 89 44.0 72.0 167.5 64.0 6 152.0 44.0 19.0 145 177.0 344.0 176.0 76.0 5.0 10.0 186.0 189
In [21]:
print("Shape of y -->", y.shape[0])
y.head()
Shape of y --> 846
Out[21]:
0    2
1    2
2    1
3    2
4    0
Name: class, dtype: int64
In [22]:
# Optionally, splitting X and Y into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Shapes of the resulting datasets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("Y_train shape:", y_train.shape[0])
print("Y_test shape:", y_test.shape[0])
X_train shape: (676, 18)
X_test shape: (170, 18)
Y_train shape: 676
Y_test shape: 170

*SOLUTION (2 B.)*¶

We'll be using Standard Scaler to standardize the data

In [23]:
# Standardizing the data
standard_scaler = StandardScaler()

# # scaling the training data using the standard scaler and transforming both the training and test data
X_scaled = standard_scaler.fit_transform(X)
X_train_scaled = standard_scaler.fit_transform(X_train)
X_test_scaled = standard_scaler.transform(X_test)
In [24]:
# Converting the scaled data back to DataFrame for better readability
X_scaled_df = pd.DataFrame(X_scaled, columns = X_train.columns)
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns = X_train.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns = X_test.columns)

# Shapes of the resulting datasets to ensure correctness
print("X_tscaled_df shape:", X_scaled_df.shape)
print("X_train_scaled_df shape:", X_train_scaled_df.shape)
print("X_test_scaled_df shape:", X_test_scaled_df.shape)

print("\nScaled Training Set - ")
X_train_scaled_df.head()
X_tscaled_df shape: (846, 18)
X_train_scaled_df shape: (676, 18)
X_test_scaled_df shape: (170, 18)

Scaled Training Set - 
Out[24]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
0 -0.447383 0.367465 0.185274 -0.237831 0.487192 0.904825 -0.167288 0.135616 -0.205454 0.844217 -0.351616 -0.276018 0.355914 0.198339 0.621442 -1.460108 -0.668052 -0.135318
1 1.069828 1.194934 1.239419 1.422429 0.487192 0.904825 1.098723 -1.060166 1.012853 0.635555 0.920385 1.085722 0.387751 -1.132454 -1.360137 0.703190 0.644271 0.980233
2 1.449131 1.360428 1.700607 1.198933 0.305436 1.896174 1.700079 -1.325895 1.825058 1.400648 1.332926 1.661612 0.865317 -0.134359 1.942494 2.524914 -0.011891 0.980233
3 2.713474 1.194934 1.173535 1.103148 0.123680 0.904825 1.541827 -1.325895 1.418956 1.191986 1.436061 1.571629 1.279207 0.031990 1.722319 -1.004677 -0.175931 0.143570
4 1.954868 1.691415 0.909999 0.720011 -0.785100 -1.077871 2.301434 -1.724490 2.231161 1.539756 2.742440 2.531446 2.170663 2.194529 0.401266 -0.435388 -0.832093 -1.669202

QUESTION 3¶


In [25]:
# Create Confusion Matrix
def plot_confusion_matrix(conf_matrix_train, conf_matrix_test):
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
    sns.heatmap(conf_matrix_train, annot = True, fmt = 'd', cmap = 'Blues', ax = axes[0])
    axes[0].set_title('Confusion Matrix for Training Set')
    axes[0].set_xlabel('Predicted Labels')
    axes[0].set_ylabel('Actual Labels')
    axes[0].set_xticks(ticks = [0.5, 1.5, 2.5], labels = ['bus','car', 'van'])
    axes[0].set_yticks(ticks = [0.5, 1.5, 2.5], labels = ['bus','car', 'van'])

    sns.heatmap(conf_matrix_test, annot = True, fmt = 'd', cmap = 'Blues', ax = axes[1])
    axes[1].set_title('Confusion Matrix for Testing Set')
    axes[1].set_xlabel('Predicted Labels')
    axes[1].set_ylabel('Actual Labels')
    axes[1].set_xticks(ticks = [0.5, 1.5, 2.5], labels = ['bus','car', 'van'])
    axes[1].set_yticks(ticks = [0.5, 1.5, 2.5], labels = ['bus','car', 'van'])
    
    plt.tight_layout()
    plt.show()
    
    print("Confusion Matrix for Final Training Set:\n", conf_matrix_train)
    print("Confusion Matrix for Final Testing Set:\n", conf_matrix_test)
In [26]:
# Create Evaluation Metrics for checking performance of the model
def evaluation_metrics(y_train, y_train_pred, y_test, y_test_pred, conf_matrix_train, conf_matrix_test):
    
    # Compute evaluation metrics for training set
    accuracy_train = accuracy_score(y_train, y_train_pred)
    precision_train = precision_score(y_train, y_train_pred, average='macro')
    recall_train = recall_score(y_train, y_train_pred, average='macro')
    f1_train = f1_score(y_train, y_train_pred, average='macro')
    
    # Compute evaluation metrics for testing set
    accuracy_test = accuracy_score(y_test, y_test_pred)
    precision_test = precision_score(y_test, y_test_pred, average='macro')
    recall_test = recall_score(y_test, y_test_pred, average='macro')
    f1_test = f1_score(y_test, y_test_pred, average='macro')
    
    # Print evaluation metrics on train set
    print("Training Set:")
    print("    Accuracy:", format(accuracy_train, '.3f'))
    print("    Recall:", format(recall_train, '.3f'))
    print("    Precision:", format(precision_train, '.3f'))
    print("    F1 Score:", format(f1_train, '.3f'))
    
    # Print evaluation metrics on test set
    print("Testing Set:")
    print("    Accuracy:", format(accuracy_test, '.3f'))
    print("    Recall:", format(recall_test, '.3f'))
    print("    Precision:", format(precision_test, '.3f'))
    print("    F1 Score:", format(f1_test, '.3f'))
    
    print("\nClassification Report for Training Set:")
    print(classification_report(y_train, y_train_pred))
    
    print("\nClassification Report for Testing Set:")
    print(classification_report(y_test, y_test_pred))
    
    # Plot confusion matrix
    print("\n Confusion Matrix:")
    plot_confusion_matrix(conf_matrix_train, conf_matrix_test)

*SOLUTION (3 A.)*¶

In [27]:
# Train a base SVM model
svm_model = SVC(random_state = 42)
svm_model
Out[27]:
SVC(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(random_state=42)
In [28]:
svm_model.fit(X_train_scaled, y_train)
Out[28]:
SVC(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(random_state=42)
In [29]:
# Predicting the train and test set

y_test_pred = svm_model.predict(X_test_scaled)
y_train_pred = svm_model.predict(X_train_scaled)

print("Train Set Accuracy:", format(accuracy_score(y_train, y_train_pred), '.5f'))
print("Test Set Accuracy:", format(accuracy_score(y_test, y_test_pred), '.5f'))
Train Set Accuracy: 0.98669
Test Set Accuracy: 0.97647

The model has shown strong performance on both the training and test sets, indicating that the SVM classifier is working well with the current data.


Let's take a further look at the evaluation metrics

*SOLUTION (3 B.)*¶

In [30]:
# Evaulating the performance of the model on train and test set

conf_matrix_train = confusion_matrix(y_train, y_train_pred)
conf_matrix_test = confusion_matrix(y_test, y_test_pred)

evaluation_metrics(y_train, y_train_pred, y_test, y_test_pred, conf_matrix_train, conf_matrix_test)
Training Set:
    Accuracy: 0.987
    Recall: 0.987
    Precision: 0.985
    F1 Score: 0.986
Testing Set:
    Accuracy: 0.976
    Recall: 0.979
    Precision: 0.972
    F1 Score: 0.975

Classification Report for Training Set:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       166
           1       0.99      0.99      0.99       351
           2       0.97      0.98      0.98       159

    accuracy                           0.99       676
   macro avg       0.98      0.99      0.99       676
weighted avg       0.99      0.99      0.99       676


Classification Report for Testing Set:
              precision    recall  f1-score   support

           0       1.00      0.96      0.98        52
           1       0.99      0.97      0.98        78
           2       0.93      1.00      0.96        40

    accuracy                           0.98       170
   macro avg       0.97      0.98      0.97       170
weighted avg       0.98      0.98      0.98       170


 Confusion Matrix:
No description has been provided for this image
Confusion Matrix for Final Training Set:
 [[165   0   1]
 [  2 346   3]
 [  0   3 156]]
Confusion Matrix for Final Testing Set:
 [[50  1  1]
 [ 0 76  2]
 [ 0  0 40]]

Based on the above Training set Classification and Evaluation Metrics, we can see that:

  • The trained SVM model achieves an accuracy of 98.7%, indicating that it correctly predicts the class for 98.7% of the instances in the training set
  • Precision is high across all classes: 99% for 'bus', 99% for 'car', and 97% for 'van'. This means that all positive predictions made by the model are indeed correct most of the times
  • Recall and F1-Score are also high for all 3 classes with both 99% for 'bus', 99% for 'car' and 98% for 'van' which says that the true positives are correctly identified by the model out of all actual positives
  • The model performs exceptionally well across all evaluated metrics, indicating correstness in distinguishing between different classes
  • Class 'car', with the highest number of instances (346), shows a strong performance in precision (99%) and recall (99%)
  • Based on the accuracy values on testing set, we can say that the model is overal performing well and does not overfit
In [31]:
# Creating a table for final comparison of performance
final_comparison = pd.DataFrame({'Model': ['Base Classification SVM Model'],
                                 'Accuracy(Training Set)': format(accuracy_score(y_train, y_train_pred), '.4f'), 
                                 'Accuracy(Testing Set)': format(accuracy_score(y_test, y_test_pred), '.4f')})
final_comparison.style.set_properties(**{'text-align': 'center'}).set_table_styles([{
    'selector': 'th',
    'props': [('text-align', 'center')]
}])
Out[31]:
  Model Accuracy(Training Set) Accuracy(Testing Set)
0 Base Classification SVM Model 0.9867 0.9765

*SOLUTION (3 C.)*¶

In [32]:
# Initialize PCA with 10 components
pca = PCA(n_components = 10, random_state = 42)
pca
Out[32]:
PCA(n_components=10, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=10, random_state=42)
In [33]:
X_scaled_pca = pca.fit_transform(X_scaled)
X_scaled_pca_df = pd.DataFrame(data = X_scaled_pca, columns=[f'PC{i+1}' for i in range(10)])
X_scaled_pca_df.head()
Out[33]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
0 0.616786 -0.656198 -0.588809 0.675499 -0.834970 -1.875319 -0.159352 0.681260 0.300094 0.182714
1 -1.528457 -0.350607 -0.247520 -1.333410 -0.241597 -0.106052 0.207227 -0.123113 -0.162492 -0.357933
2 4.036602 0.253970 -1.251445 -0.166392 0.964006 -0.634244 0.846855 -0.145021 0.328844 -0.499040
3 -1.546115 -3.101297 -0.475628 -0.411329 -0.616389 0.373662 0.114184 0.189917 -0.351054 0.293456
4 -1.495061 0.933869 -0.118030 1.152771 0.392378 -0.422421 -0.296414 0.101058 0.264962 0.103133
In [34]:
train_components = pca.fit_transform(X_train_scaled)
train_components_df = pd.DataFrame(data = train_components, columns=[f'PC{i+1}' for i in range(10)])

train_components_df.head()
Out[34]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
0 0.078779 0.724597 1.881145 0.334867 -0.782546 -0.696754 0.367182 -0.190328 -0.169587 0.151643
1 3.661733 -1.306332 -1.137874 0.286916 -1.076376 0.207300 -0.244219 0.063574 0.290371 0.247014
2 5.349105 -0.343910 -0.481351 -2.351430 0.993502 -1.577530 0.086273 0.253858 -0.372301 0.107853
3 4.830518 0.559451 1.611400 -0.468141 0.952707 0.788349 0.944737 0.886320 0.408963 -0.115990
4 5.341616 3.967103 -0.357915 0.545154 1.949493 1.920468 -0.164908 0.440096 -0.211912 -0.156900
In [35]:
test_components = pca.transform(X_test_scaled)
test_components_df = pd.DataFrame(data = test_components, columns=[f'PC{i+1}' for i in range(10)])

test_components_df.head()
Out[35]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
0 -1.664303 1.004923 1.184971 2.663422 0.556912 -1.293941 -0.005094 -0.611842 0.076082 0.451762
1 -3.593773 1.739502 0.885594 -1.433330 0.396088 0.438236 0.930078 0.999913 0.301610 0.011017
2 -2.562531 2.743790 -1.087999 0.060070 -0.528041 0.301019 -0.274937 0.596780 0.117052 -0.304579
3 -2.023648 0.706398 1.194657 -0.575487 2.059044 -0.865997 0.286400 -0.386585 0.546185 0.209683
4 -2.093355 2.919766 -0.363971 0.082303 -0.429702 -0.371410 -0.884668 -0.519972 -0.192239 -0.178418

Above, we have applied PCA (Principal Component Analysis) on our training and test data. With this, we have reduced the dimensionality of our data while retaining the maximum amount of variance possible


We have also transformed our dataset along with it's training and test set and we can also use them now for further analysis, modeling, or visualization.

*SOLUTION (3 D.)*¶

To visualize the cumulative variance explained by the number of principal components as suggested in the question, we can plot a line graph that shows how the cumulative variance increases with each additional component.

Let's see the implementation of the same:

In [36]:
# Calculating the explained variance ratios
explained_variance_ratio = pca.explained_variance_ratio_
explained_variance_ratio
Out[36]:
array([0.54934096, 0.18430257, 0.06879469, 0.06433195, 0.04711015,
       0.0338286 , 0.01812279, 0.01267838, 0.00618011, 0.00415381])
In [37]:
cumulative_explained_variance = explained_variance_ratio.cumsum()
cumulative_explained_variance
Out[37]:
array([0.54934096, 0.73364353, 0.80243822, 0.86677017, 0.91388031,
       0.94770891, 0.9658317 , 0.97851008, 0.98469019, 0.988844  ])
In [38]:
# Printing explained variance ratios for each principal component
print("Explained Variance Ratios:")
for i, evr in enumerate(explained_variance_ratio):
    print(f"    PC{i+1}: {evr:.5f}")
Explained Variance Ratios:
    PC1: 0.54934
    PC2: 0.18430
    PC3: 0.06879
    PC4: 0.06433
    PC5: 0.04711
    PC6: 0.03383
    PC7: 0.01812
    PC8: 0.01268
    PC9: 0.00618
    PC10: 0.00415
In [39]:
# Printing explained variance ratios for each principal component
print("Cumulative Variance:")
for i, cumulative_var in enumerate(cumulative_explained_variance, 1):
    print(f"    Component {i}: {cumulative_var:.5f}")
Cumulative Variance:
    Component 1: 0.54934
    Component 2: 0.73364
    Component 3: 0.80244
    Component 4: 0.86677
    Component 5: 0.91388
    Component 6: 0.94771
    Component 7: 0.96583
    Component 8: 0.97851
    Component 9: 0.98469
    Component 10: 0.98884
In [40]:
overall_variance_explained = np.sum(explained_variance_ratio)
print(f"Overall Variance Explained by 10 PCA components: {overall_variance_explained:.4f}")
Overall Variance Explained by 10 PCA components: 0.9888
In [41]:
# Visualization of above
plt.figure(figsize = (10, 6))
plt.plot(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, marker = 'o',
         linestyle = '--', color = 'b')
plt.title('Cumulative Variance Explained by Number of Principal Components (Total = 10)')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.grid(True)
plt.show()
No description has been provided for this image
  • We see that there is a steep initial rise in the cumulative variance
  • There is an elbow point appearing to be at 3 or 4 primcipal components after which we see there is a slow increase in the explained variance values
  • The curve also starts to flatten after 6 components and reaches almost to the value of 1 which explianes that the 10th component was able to capture almost all the variance of our dataset

*SOLUTION (3 E.)*¶

In [42]:
# Adding a horizontal line to highlight the threshold of 90%
plt.figure(figsize = (10, 6))
plt.plot(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, marker = 'o',
         linestyle = '--', color = 'b')
plt.axhline(y = 0.90, color = 'r', linestyle = '-')
plt.title('Cumulative Variance Explained by Number of Principal Components (Total = 10)')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.xticks(range(1, len(cumulative_explained_variance) + 1))
plt.grid(True)
plt.show()
No description has been provided for this image

We've added a red horizontal line at 90% of the total variance. In this case, it’s reached just before including the 5th principal component. So, we can say that around 5 principal components are needed to capture approximately 90% of the total variance in the dataset and should give a good balance in our data

*SOLUTION (3 F.)*¶

To apply PCA on minimum number of components with 90% or above variance, let's find the number of compinents with such variance

In [43]:
# Determine the number of components for 90% variance
num_components = np.argmax(cumulative_explained_variance >= 0.90) + 1
print(" The suitable number of minimum compoenents to explain 90% or more variance is -", num_components)
 The suitable number of minimum compoenents to explain 90% or more variance is - 5
In [44]:
pca_90 = PCA(n_components = num_components, random_state = 42)
pca_90
Out[44]:
PCA(n_components=5, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=5, random_state=42)
In [45]:
X_scaled_pca_90 = pca_90.fit_transform(X_scaled)
X_scaled_pca_90_df = pd.DataFrame(data = X_scaled_pca_90, columns=[f'PC_90_{i+1}' for i in range(num_components)])

print("There are", X_scaled_pca_90_df.shape[0], "rows and", X_scaled_pca_90_df.shape[1], "columns in the below dataframe")
X_scaled_pca_90_df.head()
There are 846 rows and 5 columns in the below dataframe
Out[45]:
PC_90_1 PC_90_2 PC_90_3 PC_90_4 PC_90_5
0 0.616786 -0.656198 -0.588809 0.675499 -0.834970
1 -1.528457 -0.350607 -0.247520 -1.333410 -0.241597
2 4.036602 0.253970 -1.251445 -0.166392 0.964006
3 -1.546115 -3.101297 -0.475628 -0.411329 -0.616389
4 -1.495061 0.933869 -0.118030 1.152771 0.392378
In [46]:
train_components_90 = pca_90.fit_transform(X_train_scaled)
train_components_90_df = pd.DataFrame(data = train_components_90, columns=[f'PC_90_{i+1}' for i in range(num_components)])

print("There are", train_components_90_df.shape[0], "rows and", train_components_90_df.shape[1], "columns in the below dataframe")
train_components_90_df.head()
There are 676 rows and 5 columns in the below dataframe
Out[46]:
PC_90_1 PC_90_2 PC_90_3 PC_90_4 PC_90_5
0 0.078779 0.724597 1.881145 0.334867 -0.782546
1 3.661733 -1.306332 -1.137874 0.286916 -1.076376
2 5.349105 -0.343910 -0.481351 -2.351430 0.993502
3 4.830518 0.559451 1.611400 -0.468141 0.952707
4 5.341616 3.967103 -0.357915 0.545154 1.949493
In [47]:
test_components_90 = pca_90.transform(X_test_scaled)
test_components_90_df = pd.DataFrame(data = test_components_90, columns=[f'PC_90_{i+1}' for i in range(num_components)])

print("There are", test_components_90_df.shape[0], "rows and", test_components_90_df.shape[1], "columns in the below dataframe")
test_components_90_df.head()
There are 170 rows and 5 columns in the below dataframe
Out[47]:
PC_90_1 PC_90_2 PC_90_3 PC_90_4 PC_90_5
0 -1.664303 1.004923 1.184971 2.663422 0.556912
1 -3.593773 1.739502 0.885594 -1.433330 0.396088
2 -2.562531 2.743790 -1.087999 0.060070 -0.528041
3 -2.023648 0.706398 1.194657 -0.575487 2.059044
4 -2.093355 2.919766 -0.363971 0.082303 -0.429702

Now that we have applied PCA on minimum nummber of componenets, which is 5, let's try to plot it similar to above

In [48]:
explained_variance_ratio_90 = pca_90.explained_variance_ratio_
cumulative_explained_variance_90 = explained_variance_ratio_90.cumsum()

# Adding a horizontal line to highlight the threshold of 90%
plt.figure(figsize = (10, 6))
plt.plot(range(1, len(cumulative_explained_variance_90) + 1), cumulative_explained_variance_90, marker = 'o',
         linestyle = '--', color = 'b')
plt.axhline(y = 0.90, color = 'r', linestyle = '-')
plt.title('Cumulative Variance Explained by Number of Principal Components (Total = 5)')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.xticks(range(1, len(cumulative_explained_variance_90) + 1))
plt.grid(True)
plt.show()
No description has been provided for this image
In [49]:
overall_variance_explained_90 = np.sum(explained_variance_ratio_90)
print(f"Overall Variance Explained by minimum number of PCA components : {overall_variance_explained_90:.4f}")
Overall Variance Explained by minimum number of PCA components : 0.9139

*SOLUTION (3 G.)*¶

In [50]:
# Initializing the SVM classifier
svm_pca_90 = SVC(random_state = 42)
svm_pca_90
Out[50]:
SVC(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(random_state=42)
In [51]:
# Training SVM on the PCA-transformed training data
svm_pca_90.fit(train_components_90_df, y_train)
Out[51]:
SVC(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(random_state=42)
In [52]:
# Predicting the train and test set

y_test_pca_90_pred = svm_pca_90.predict(test_components_90)
y_train_pca_90_pred = svm_pca_90.predict(train_components_90)

print("Train Set Accuracy:", format(accuracy_score(y_train, y_train_pca_90_pred), '.5f'))
print("Test Set Accuracy:", format(accuracy_score(y_test, y_test_pca_90_pred), '.5f'))
Train Set Accuracy: 0.91272
Test Set Accuracy: 0.89412

*SOLUTION (3 H.)*¶

In [53]:
# Evaulating the performance of the model on train and test set

conf_matrix_train = confusion_matrix(y_train, y_train_pca_90_pred)
conf_matrix_test = confusion_matrix(y_test, y_test_pca_90_pred)

evaluation_metrics(y_train, y_train_pca_90_pred, y_test, y_test_pca_90_pred, conf_matrix_train, conf_matrix_test)
Training Set:
    Accuracy: 0.913
    Recall: 0.909
    Precision: 0.907
    F1 Score: 0.908
Testing Set:
    Accuracy: 0.894
    Recall: 0.896
    Precision: 0.887
    F1 Score: 0.890

Classification Report for Training Set:
              precision    recall  f1-score   support

           0       0.93      0.95      0.94       166
           1       0.93      0.92      0.92       351
           2       0.87      0.86      0.86       159

    accuracy                           0.91       676
   macro avg       0.91      0.91      0.91       676
weighted avg       0.91      0.91      0.91       676


Classification Report for Testing Set:
              precision    recall  f1-score   support

           0       0.94      0.90      0.92        52
           1       0.92      0.88      0.90        78
           2       0.80      0.90      0.85        40

    accuracy                           0.89       170
   macro avg       0.89      0.90      0.89       170
weighted avg       0.90      0.89      0.90       170


 Confusion Matrix:
No description has been provided for this image
Confusion Matrix for Final Training Set:
 [[158   7   1]
 [  8 323  20]
 [  4  19 136]]
Confusion Matrix for Final Testing Set:
 [[47  4  1]
 [ 1 69  8]
 [ 2  2 36]]

From above, we san see that:

  • The model achieves an accuracy of 91.3% on the training set, while a slightly lower accuracy of 89.4% on the testing set
  • Overall precision is also ~2% lesser in testing set when compared to training set
  • Recall for training set is ~1% more than the testing set
  • F1-Score, however, indicates a good balance between the recall and precision in model's prediction
  • Comparing this metrics with the one acheived from SVM model trained on original PCA, we see that the original one performed better with the training and testing data accuracy, recall, precision and f1-score close to each other


Overall, the model performs almost equally on training set and test set, which can indicates it is not mostly overfitted. However seeing a reduced performance from the base classification model, we can do some hyperparameter tuning to improve the performance

In [54]:
# Adding values in comparison table table for final comparison of performance
temp_dataframe = pd.DataFrame({'Model': ['SVM Model (PCA with 5 components)'],
                                 'Accuracy(Training Set)': format(accuracy_score(y_train, y_train_pca_90_pred), '.4f'), 
                                 'Accuracy(Testing Set)': format(accuracy_score(y_test, y_test_pca_90_pred), '.4f')})

final_comparison = pd.concat([final_comparison, temp_dataframe])
final_comparison
Out[54]:
Model Accuracy(Training Set) Accuracy(Testing Set)
0 Base Classification SVM Model 0.9867 0.9765
0 SVM Model (PCA with 5 components) 0.9127 0.8941

QUESTION 4¶


*SOLUTION (4 A.)*¶

Now, since we need to train our SVM on the componenets we got out of the 90% threshold PCA, let's start with the 'X_scaled_pca_90' dataset.

To revise, we got X_scaled_pca_90 after we applied PCA for Minimum Components with 90% or above variance (we got this as 5) explained on out scaled X set (X_scaled)

In [55]:
# printing the X_scaled_pca_90 dataframe
X_scaled_pca_90_df.head()
Out[55]:
PC_90_1 PC_90_2 PC_90_3 PC_90_4 PC_90_5
0 0.616786 -0.656198 -0.588809 0.675499 -0.834970
1 -1.528457 -0.350607 -0.247520 -1.333410 -0.241597
2 4.036602 0.253970 -1.251445 -0.166392 0.964006
3 -1.546115 -3.101297 -0.475628 -0.411329 -0.616389
4 -1.495061 0.933869 -0.118030 1.152771 0.392378
In [56]:
# Let's perform train-test split on this PCA-transformed data
X_scaled_pca_90_train, X_scaled_pca_90_test, y_train, y_test = train_test_split(X_scaled_pca_90,
                                                                                y, test_size = 0.2, random_state = 42)
In [57]:
# Shapes of the resulting datasets
print("X_scaled_pca_90_test shape:", X_scaled_pca_90_train.shape)
print("X_scaled_pca_90_test shape:", X_scaled_pca_90_test.shape)
print("Y_train shape:", y_train.shape[0])
print("Y_test shape:", y_test.shape[0])
X_scaled_pca_90_test shape: (676, 5)
X_scaled_pca_90_test shape: (170, 5)
Y_train shape: 676
Y_test shape: 170

Why we need to split out train and test sets for our data again?

  • If we apply PCA to the whole dataset before dividing it into training and testing parts, information from the testing part can unintentionally influence how PCA works on the training part
  • This might make the model seem better at predicting new data than it actually is, leading to overfitting
  • To get a more accurate assessment, it's better to split the data first. That way, PCA only learns from the training data, which keeps the evaluation fair and realistic
In [58]:
# Now, let's initialize our new SVM model
svm_tuned = SVC(random_state = 42)
svm_tuned
Out[58]:
SVC(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(random_state=42)
In [59]:
# Defining parameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf']#, 'poly'],
    #'degree': [2, 3, 4]
}

# Initializing GridSearchCV
grid_search = GridSearchCV(estimator = svm_tuned, param_grid = param_grid,
                           cv = 5, scoring = 'accuracy', verbose = 1, n_jobs = -1)

grid_search
Out[59]:
GridSearchCV(cv=5, estimator=SVC(random_state=42), n_jobs=-1,
             param_grid={'C': [0.1, 1, 10, 100], 'gamma': [0.1, 1, 10, 100],
                         'kernel': ['linear', 'rbf']},
             scoring='accuracy', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=SVC(random_state=42), n_jobs=-1,
             param_grid={'C': [0.1, 1, 10, 100], 'gamma': [0.1, 1, 10, 100],
                         'kernel': ['linear', 'rbf']},
             scoring='accuracy', verbose=1)
SVC(random_state=42)
SVC(random_state=42)
In [60]:
# Performing Grid Search to find best parameters

start_time_fit = time.time();
fit_grid_search = grid_search.fit(X_scaled_pca_90_train, y_train)
end_time_fit = time.time();

print("SVM Grid Search Tuning on PCA-transformed data Time:", 
      format((end_time_fit - start_time_fit), '.2f'), "seconds")

print("\n Fitted Model - \n")
fit_grid_search
fit_grid_search
Fitting 5 folds for each of 32 candidates, totalling 160 fits
SVM Grid Search Tuning on PCA-transformed data Time: 3.93 seconds

 Fitted Model - 

Out[60]:
GridSearchCV(cv=5, estimator=SVC(random_state=42), n_jobs=-1,
             param_grid={'C': [0.1, 1, 10, 100], 'gamma': [0.1, 1, 10, 100],
                         'kernel': ['linear', 'rbf']},
             scoring='accuracy', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=SVC(random_state=42), n_jobs=-1,
             param_grid={'C': [0.1, 1, 10, 100], 'gamma': [0.1, 1, 10, 100],
                         'kernel': ['linear', 'rbf']},
             scoring='accuracy', verbose=1)
SVC(random_state=42)
SVC(random_state=42)

*SOLUTION (4 B.)*¶

In [61]:
# Getting the best parameters found by Grid Search
best_parameters = grid_search.best_params_
print("Best Parameters found by Grid Search on PCA-transformed data - \n       ", best_parameters)
Best Parameters found by Grid Search on PCA-transformed data - 
        {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}

Conclusing on the best parameters to tune our model, we see below to be used:

  • 'C' = 10
  • 'gamma' = '0.1'
  • 'kernel' = 'rbf'

We will be using these parameters to have further predictions

*SOLUTION (4 C.)*¶

In [62]:
best_model = grid_search.best_estimator_
best_model.fit(X_scaled_pca_90_train, y_train)
Out[62]:
SVC(C=10, gamma=0.1, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(C=10, gamma=0.1, random_state=42)
In [63]:
y_train_pred_tuned = best_model.predict(X_scaled_pca_90_train)
y_test_pred_tuned = best_model.predict(X_scaled_pca_90_test)

print("Train Set Accuracy:", format(accuracy_score(y_train, y_train_pred_tuned), '.5f'))
print("Test Set Accuracy:", format(accuracy_score(y_test, y_test_pred_tuned), '.5f'))
Train Set Accuracy: 0.95266
Test Set Accuracy: 0.90588
In [64]:
# Evaulating the performance of the model on train and test set

conf_matrix_train = confusion_matrix(y_train, y_train_pred_tuned)
conf_matrix_test = confusion_matrix(y_test, y_test_pred_tuned)

evaluation_metrics(y_train, y_train_pred_tuned, y_test, y_test_pred_tuned, conf_matrix_train, conf_matrix_test)
Training Set:
    Accuracy: 0.953
    Recall: 0.950
    Precision: 0.950
    F1 Score: 0.950
Testing Set:
    Accuracy: 0.906
    Recall: 0.903
    Precision: 0.901
    F1 Score: 0.902

Classification Report for Training Set:
              precision    recall  f1-score   support

           0       0.96      0.98      0.97       166
           1       0.96      0.96      0.96       351
           2       0.93      0.92      0.92       159

    accuracy                           0.95       676
   macro avg       0.95      0.95      0.95       676
weighted avg       0.95      0.95      0.95       676


Classification Report for Testing Set:
              precision    recall  f1-score   support

           0       0.96      0.92      0.94        52
           1       0.91      0.91      0.91        78
           2       0.83      0.88      0.85        40

    accuracy                           0.91       170
   macro avg       0.90      0.90      0.90       170
weighted avg       0.91      0.91      0.91       170


 Confusion Matrix:
No description has been provided for this image
Confusion Matrix for Final Training Set:
 [[162   3   1]
 [  5 336  10]
 [  1  12 146]]
Confusion Matrix for Final Testing Set:
 [[48  3  1]
 [ 1 71  6]
 [ 1  4 35]]

From above confusion matrix and evaluation parameters, we can see below points:

  1. Accuracy:
    • Accuracy acheived on training set(95.3%) is slightly higher than that of testing set (90.6%).
    • This shows that along with performing well and correctly predicting 95.3% of labels, it also performs well on unseen data


  2. Precision and Recall:
    • Precision and recall values are slightly lower on the testing set when compared to the training set but remain relatively high across all classes (0, 1, 2 - 'bus', 'car', 'van')
    • Precision values range from 93% to 96% on training set and recall, also strong, range from 83% to 96%


  3. F1-Score:
    • The F1 score for training set is higher than that of t+eting set however, both indicate a good balance between precision and recall


  4. Confusion Matrix:
    • It shows that while most predictions are accurate on the training set, there are some misclassifications, particularly in distinguishing between classes 'bus' and 'car' (class 0 and class 1)
    • Similar to the training set, testing set also shows patterns of misclassifications in accurately predicting classes 0 and 2 ("bus' and 'van')


  5. Overall Insights:
    • SVM model trained on PCA-transformed data shows strong performance metrics across accuracy, precision, recall, and F1 scores, indicating effective learning and generalization capabilities in our dataset
    • Despite strong performance, further tuning might be helpful in potentially improving accuracy, especially in distinguishing between classes that are more challenging to differentiate
In [65]:
# Adding values in comparison table table for final comparison of performance
temp_dataframe = pd.DataFrame({'Model': ['SVM Model Tuned (PCA-transformed)'],
                                 'Accuracy(Training Set)': format(accuracy_score(y_train, y_train_pred_tuned), '.4f'), 
                                 'Accuracy(Testing Set)': format(accuracy_score(y_test, y_test_pred_tuned), '.4f')})

final_comparison = pd.concat([final_comparison, temp_dataframe])
In [66]:
final_comparison
Out[66]:
Model Accuracy(Training Set) Accuracy(Testing Set)
0 Base Classification SVM Model 0.9867 0.9765
0 SVM Model (PCA with 5 components) 0.9127 0.8941
0 SVM Model Tuned (PCA-transformed) 0.9527 0.9059

Based on the above accuracy table, we can conclude below points:

  1. Base Classification SVM Model
    • It got a high accuracy of 98.67%, indicating that the model fits the training data very well along with maintaining a high accuracy of 97.65% on testing set, suggesting strong generalization to unseen data
    • This model shows excellent performance on both training and testing sets, indicating that it effectively learns and predicts the classes in the dataset without overfitting


  2. SVM Model (PCA with 5 components):
    • While it achieved an accuracy of 91.27%, it is lower compared to the base SVM model. The performance decreased to to 89.41% on the testing set as well
    • The use of PCA with only 5 components reduces the dimensionality of the data but also reduces the amount of explained variance, impacting the model's ability to capture the dataset's variability
    • Lower accuracy on both training and testing sets suggests that the reduced feature space may not provide enough information for the SVM to generalize well
    • It also indicates slight overfitting


  3. SVM Model Tuned (PCA-transformed):
    • There is an improvement in accuracy (95.27%) over the previous model on the training set along with improving performance on the testing set to 90.59%, indicating better generalization compared to the PCA with 5 components
    • We see that tuning the SVM model on PCA-transformed data has improved accuracy compared to the basic PCA approach
    • The higher accuracy on both training and testing sets suggests that the model is better leveraging the reduced but informative features obtained from PCA, combined with optimized hyperparameters
    • However, there is still a slight gap between training and testing accuracies, indicating some level of overfitting after tuning. It might be possible that the model bias towards the training data


Overall Comparison

  • The base SVM model without PCA achieves the highest accuracies on both training and testing sets, indicating superior performance in our case
  • While PCA-based models reduce dimensionality, they also compromise on accuracy to varying degrees. Tuning the SVM on PCA-transformed data helps mitigate some of these drawbacks but may still not match the performance of the base SVM model as base SVM model performs well without any overfitting unlike the other 2 models



The Overall Variance Explained by 10 PCA components is 0.9888 which is much higher than the Overall Variance when Explained by minimum number (5) of PCA components which stands at 0.9139
While the former scenario demonstrates that using more components can capture nearly all the data's variability, resulting in better performance at the expense of higher dimensionality, the latter scenario balances dimensionality reduction and information retention. This balance makes it suitable for scenarios where computational efficiency is critical, although it may sacrifice some accuracy

In conclusion, selecting between these models involves balancing accuracy with computational efficiency, particularly for PCA-based models. The base SVM model excels in accuracy, whereas PCA-based models provide the advantage of dimensionality reduction but it might require further careful tuning to achieve optimal performance

QUESTION 5¶


*SOLUTION (5 A.)*¶

Principal Component Analysis (PCA) is a powerful technique for dimensionality reduction and feature extraction in data analysis.


Prerequisites/Assumptions of PCA

  • Numerical Data:
      PCA requires numerical data since it relies on the calculation of means, variances, and covariances


  • Scaling of Variables:
      If the variables have different units or scales, standardization (scaling to unit variance) is often performed before applying PCA. This ensures that all variables contribute equally to the analysis


  • Sensitivity to Outliers:
      PCA is sensitive to outliers, as they can disproportionately influence the calculation of means and covariances. Therefore, it is advisable to identify and handle outliers, in in larger quantity, before applying PCA


  • Linearity:
      PCA assumes that the relationships between the variables are linear. This means that PCA is most effective when the data can be approximated well by a linear combination of features.


  • Large Sample Size:

    • For PCA to be effective, a sufficiently large sample size is important
    • This ensures that the covariance matrix is stable and the principal components derived are reliable



  • Independence of Principal Components:
      The principal components are uncorrelated to each other. This is a fundamental property of PCA


  • Variance as an Indicator of Importance:
      PCA assumes that components with higher variance are more important. The method seeks to maximize the variance captured by the principal components


  • Multicollinearity:
      PCA is particularly useful when there is multicollinearity in the data, as it transforms the original correlated variables into a set of uncorrelated principal components


*SOLUTION (5 B.)*¶

Advantages of PCA

  • Dimensionality Reduction:
      PCA reduces the number of dimensions (variables) in a dataset while retaining the most important information. This simplifies the dataset and can help in visualizing high-dimensional data


  • Uncorrelated Features:
      The principal components are orthogonal (uncorrelated) to each other. This property can be useful in situations where multicollinearity is an issue, such as in regression analysis


  • Noise Reduction:
      By focusing on the principal components that capture the most variance, PCA can help in reducing noise and irrelevant information, so the small variations in the background are ignored automatically


  • Data Compression:
     PCA can be used for data compression by reducing the dimensionality of the dataset while retaining most of the original information. This can save storage space and computational resources.


  • Visualization:
      PCA can transform high-dimensional data into a lower-dimensional space (usually 2D or 3D), making it easier to visualize and understand patterns and relationships in the data.


  • Feature Extraction:
      PCA helps in identifying the most important features that contribute to the variance in the data. These new features (principal components) can be used for further analysis and modeling.


Disadvantages/Limitations of PCA

  • Linearity Assumption:
      PCA assumes linear relationships among variables. It may not capture complex, non-linear relationships in the data, which can limit its effectiveness in some applications


  • Interpretability:
      The principal components are linear combinations of the original variables and may not have a clear or intuitive interpretation, making it difficult to understand the transformed features.


  • Variance-Based Focus:
      PCA focuses on capturing the maximum variance in the data. However, high variance does not always correspond to the most important or meaningful features, especially in cases where the underlying structure is not driven by variance


  • Sensitivity to Scaling:
      PCA is sensitive to the scaling of variables. If the variables have different units or scales, they need to be standardized before applying PCA; otherwise, the results may be biased towards variables with larger scales